NLP-Supported Full-Text Retrieval
نویسنده
چکیده
The amount of information available in electronic form is growing exponentially , making it increasingly difficult to find the desired information. This is especially true of the World Wide Web, which has no central administration and thus no ordering scheme to help users find the information they need. Furthermore , most of the information is narrative, i.e., in the form of unstructured documents written in natural languages, as opposed to structured information stored in databases. Information retrieval is primarily concerned with the storage and retrieval of unstructured information. Thus, along with the growth of the World Wide Web, information retrieval systems gain importance since they are often the only way to find the few documents actually relevant to a specific question in the vast quantities of text available. Internet search engines like AltaVista or Lycos are very popular and commercially successful. Although information retrieval systems mainly deal with natural language, linguistic methods are rarely used. Most systems only use stemming, i.e., the mechanical cutting off of inflectional and derivational suffixes to better match index terms to query terms. Since most research on information retrieval is done for English, which has a relatively weak morphology, this is seldom regarded as problematic. Some researchers even consider stemming as completely unnecessary. There is, however, considerable evidence that stemming and more linguistically motivated methods do have a positive impact on retrieval performance for languages such as Dutch, German, Italian, or Slovene, which are morphologically richer than English. Morphologic phenomena like compounds and changes of the stem are still not handled by conventional stemmers. As German, for example, makes extensive use of these morphologic processes (consider compounds like Bundesverfassungsgericht, and changes of the stem like in Häuser, the plural of Haus), the application of full morphologic analysis to the information retrieval task intuitively seems to be promising. This thesis sets out to determine the usefulness of morphologic analysis in information retrieval systems, particularly for the retrieval of German-language documents. An experimental retrieval system called IRF/1 was developed as a test bed. It is described in this thesis. IRF/1 is used to compare the retrieval effectiveness of different text processing methods for a test collection of about 300 magazine articles. The evaluated methods are: 2 1. stemming (as a baseline), 2. base form reduction using morphologic analysis, 3. same as (2) but compounds are split into the base forms of their constituents , and 4. same …
منابع مشابه
What is the role of NLP in text retrieval?
This paper addresses the value of linguistically-motivated indexing (LMI) for document and text retrieval. After reviewing the basic concepts involved and the assumptions on which LMI is based, namely that complex index descriptions and terms are necessary, I consider past and recent research on LMI, and specifically on automated LMI via NLP. Experiments in the first phase of research, to the l...
متن کاملNLP-NG - A New NLP System for Biomedical Text Analysis
NLP-NG is a new NLP system consisting of three components: NG-CORE (language processing), NG-DB (database management), and NG-SEE (interactive visualization and entry). The ultimate goal of NLP-NG is to produce information retrieval systems in which users can choose full-text schema, adding specific items to focus their queries. Schema are created by a normalization process which elides adjunct...
متن کاملNatural Language in Information Retrieval
It seems the time is ripe for the two to meet: NLP has grown out of prototypes and IR is having hard time trying to improve precision. Two examples of possible approaches are considered below. Lexware is a lexiconbased system for text analysis of Swedish applied in an information retrieval task. NLIR is an information retrieval system using intensive natural language processing to provide index...
متن کاملChallenges for extracting biomedical knowledge from full text
At present, most biomedical Information Retrieval and Extraction tools process abstracts rather than full-text articles. The increasing availability of full text will allow more knowledge to be extracted with greater reliability. To investigate the challenges of full-text processing, we manually annotated a corpus of cited articles from a Molecular Interaction Map (Kohn, 1999). Our analysis dem...
متن کاملA Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew
This paper presents a comprehensive NLP system by Melingo that has been recently developed for Arabic, based on Morfix an operational formerly developed highly successful comprehensive Hebrew NLP system. The system discussed includes modules for morphological analysis, context sensitive lemmatization, vocalization, text-to-phoneme conversion, and syntactic-analysis-based prosody (intonation) ...
متن کاملRole of Natural Language Processing in Information Retrieval; Challenges and Opportunities
This paper aims to analyze the role of natural language processing (NLP). The paper will discuss the role in the context of automated data retrieval, automated question answer, and text structuring. NLP techniques are gaining wider acceptance in real life applications and industrial concerns. There are various complexities involved in processing the text of natural language that could satisfy t...
متن کامل